Module 12: Spatial Data with Simple Features

Author

Joshua French

The Quarto file used to generate the html file can be obtained by clicking on the Code Links beneath the Table of Contents which will open the Math 3376 Posit Cloud Project where you can open the file 12-Spatial-Data-with-Simple-Features.qmd.

Introduction


This document is intended to help explore how to perform basic manipulation and visualization of spatial data that can be represented using simple features. We will primarily focus on using the sf package for our analysis.

Wrangling and Visualizing Spatial Data in R


Spatial data can take many forms.

For a data scientist, spatial data is usually data for which each observation is geographic region or location.

Simple Features (officially Simple Feature Access) is a set of standards that specify a common storage and access model of geographic feature made of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems. It is formalized by both the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). Wikipedia

Simple features are comprised of:

  1. A geometry object (e.g., a point, a polygon, a line, etc.)
  2. Attribute data associated with the geometry object (e.g., the temperature at a certain time on a certain day, the number of new cases of a disease in a county in the last month).

Simple Features


As stated in the vignette to the sf package:

Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. The geometry of a tree can be the delineation of its crown, of its stem, or the point indicating its centre. Other properties may include its height, color, diameter at breast height at a particular date, and so on.

Just to clarify:

  • The geometry of an observation describes where the object is located.
    • A geometry can be a point, a polygon (which is really just an ordered collection of points), or something more complicated.
  • The attributes of a geometry object are what data scientists would usually consider the “data”.

All geometries are made of points, which can be combined to create more and more complex objects. A point can be 2-, 3-, or 4-dimensional.

The most common kinds of points are two-dimensional and are described using a 2-dimensional set of coordinates (X and Y), e.g., longitude and latitude or easting and northing.

  • 2-dimensional points are referred to as XY points.

A 3-dimensional point could include the X and Y coordinates as well as the altitude of the 2-dimensional point in 3-dimensional space.

  • The Z coordinate is another dimension denoting altitude of the point.
  • Combining a Z coordinate with an XY coordinate results in a 3-dimensional XYZ point.

Alternatively, a 3-dimensional point could include some other measure associated with the point.

  • The M coordinate is another dimension denoting some measure associated with the point.
  • It’s pretty rare, but examples include time or measurement error.
  • Combining an M coordinate with an XY coordinate results in a 3-dimensional XYM point.

Combining X, Y, Z, and M coordinates results in a 4-dimensional point.

Packages and Tools for Working with Simple Features


In what follows, we will use several packages.

  • The sf package is an R package for working with simple features (sf objects) both in terms of the geometry objects and the associated attributes.
    • The sf package can import, manipulate, and plot sf objects.
    • The sf package is intended to supersede the sp package, which is an older R package for working with spatial data.
    • Since the sp package is being superseded by the sf package, I recommend working with the sf package for spatial data analysis moving forward.

Because of how sf objects are represented in R, simple features (once constructed or imported) can be manipulated and plotted by many other well-known packages.

  • The dplyr package can be used to manipulate sf objects, and we will utilize it as certain times.
  • The ggplot2 package can also be used to plot sf objects, which we will demonstrate.

Choosing a color palette to represent the attributes of our simple features is very important.

  • The grDevices package included with base R provides many color palettes.
    • The traditional palettes are rainbow, heat.colors, terrain.colors, topo.colors, and cm.colors.
    • The hcl.colors palette function was added in R 4.0.0 and provides numerous excellent color palettes.

hcl color palette
  • Running the Examples in the documentation found in ?grDevices::hcl.colors will show examples of the available palettes through grDevices.

  • We provide color swatches for many palettes at the end of this tutorial (without the code).

  • The colorspace package provides many of the same HCL palettes available through the hcl.colors palette, but it also provides convenient functions for accessing this in ggplot2.

Loading Required Packages


We load all of these packages below, with the exception of colorspace, as it has a coords function that masks (i.e., replaces) a needed function in the sf package. So we will access the necessary colorspace function using the :: approach.

# library(colorspace)
library(sf)
library(dplyr)
library(ggplot2)

Simple Feature Geometry Objects


The most common simple feature geometry objects used by data scientists are:

geometry function description
POINT sf::st_point A geometry containing a single point
POLYGON sf::st_polygon A geometry with a sequence of points that form a closed ring that doesn’t intersect itself. Multiple rings form outer rings and holes within the polygon.

We go through the process of creating and plotting these geometries below.

# create an XY point
(p1 <- st_point(c(1,2)))
# create an XYZ point
(p2 <- st_point(c(1,2,3)))
# the points look the same when plotted in two dimensions
plot(p1)

# 3-dim point plotted in 2-dim
plot(p2)

  • The polygon outer starts at \((0,0)\), then goes to \((10,0)\), followed by \((10,10)\), then \((0,10)\), and finally stop back at \((0,0)\)
  • The polygon hole1 starts at \((1,1)\), then goes to \((1,3)\), followed by \((2,2)\), then \((2,1)\), and finally stop back at \((1,1)\)
  • The polygon hole2 starts at \((5,5)\), then goes to \((5,6)\), followed by \((6,6)\), then \((7,5)\), and finally stop back at \((5,5)\)
# create a ring (connected set of points)
outer <- matrix(c(0, 0, 10, 0, 10, 10, 0, 10, 0, 0), ncol = 2, byrow = TRUE)
# create additional rings for holes
hole1 <- matrix(c(1, 1, 1, 3, 2, 2, 2, 1, 1, 1), ncol = 2, byrow = TRUE)
hole2 <- matrix(c(5, 5, 5, 6, 6, 6, 7, 5, 5, 5), ncol = 2, byrow = TRUE)
# combine rings into a list
pts <- list(outer, hole1, hole2)
# turn list of rings into a polygon
(pl1 <- st_polygon(pts))
# plot polygon
plot(pl1)

Question 1


Use the code cell below to construct and display a pentagon that has a triangular hole inside.

Solution to Question 1


# create polygon

Common Geometry Objects


The other common geometry objects are:

geometry function description
LINESTRING sf::st_linestring A sequence of points that is connected with straight, non-self intersecting lines
MULTIPOINT sf::st_multipoint A set of points
MULTIPOLYGON sf::st_multipolygon A set of polygons
MULTILINESTRING sf::st_multilinestring A set of line strings
GEOMETRYCOLLECTION sf::st_geometrycollection A set of the other geometries (except itself)

Examples of Geometry Objects


We provide examples of creating and plotting these geometries below.

# create a matrix of multiple points
(pts <- matrix(rnorm(10), ncol = 2))
           [,1]       [,2]
[1,] -0.2942841  0.4042901
[2,] -0.5631947 -2.8190104
[3,]  0.3776016 -1.3146976
[4,] -0.1430829  1.4013844
[5,]  0.2235321 -1.9659875
# convert matrix of points to linestring
(ls1 <- st_linestring(pts))
plot(ls1)

# convert matrix of points to multipoints
(mp1 <- st_multipoint(pts))
plot(mp1)

# create multipolygons
pol1 <- list(outer, hole1, hole2)
pol2 <- list(outer + 24)
mp <- list(pol1, pol2)
(mpl1 <- st_multipolygon(mp))
plot(mpl1, axes = TRUE)

# create a multilinestring
(pts2 <- matrix(rnorm(6), ncol = 2))
         [,1]       [,2]
[1,] 1.189007 -1.0965091
[2,] 1.721922 -0.7875281
[3,] 0.918795  0.6693628
(ml1 <- st_multilinestring(list(pts, pts2)))
plot(ml1)

# create a geometry collection
(gc <- st_geometrycollection(list(p1, pl1, ls1)))
plot(gc, col = "grey")

How Do I Know What Type of Geometry I Need?


  • Often, this is determined automatically when reading in shapefiles (which we’ll discuss later).
  • Attributes observed at a single location require a POINT.
  • Most regions can be represented by a POLYGON.
  • Complicated objects made of regions, e.g., an island chain like Hawaii, require MULTIPOLYGONS.
  • The other types are for more complicated objects.
  • There are 10 other rarer geometry types that we do not discuss (CIRCULARSTRING, CURVE, SURFACE, TRIANGLE, COMPOUNDCURVE, CURVEPOLYGON, MULTICURVE, MULTISURFACE, POLYHEDRALSURFACE, TIN). We can learn about them through the additional resources provided at the end of this document.

Coordinate Reference Systems


A coordinate reference system (CRS) must be provided in order to place a point on the earth’s surface.

When we import a geometry object from file, the CRS will often be provided.

  • The WGS84 CRS is often the default for longitude/latitude coordinates.

Sometimes, in order to combine geometry objects, we must specify the CRS of a geometry object.

There are many CRSs. A CRS is often used because it is known to have a certain desirable property. A discussion of CRSs is beyond the scope of this tutorial. And when you do need to know about CRSs, it will probably be so specific that a general discussion won’t help. However, here are a few references related to CRSs that may provide a bit more detail:

Constructing Simple Feature (sf) Objects


An sf object is a data.frame that has a simple feature geometry list column (i.e., a column of geometry objects). So we can work with sf objects similar to how we would work with a data.frame object, though it may have different default behaviors because it has been extended to an sf object.

  • The geometry-list column contains the geometry object for each observation.
  • The other columns of the sf object contain the attributes of the geometry object.
  • Each row of the sf object is a simple feature. Alternatively, each observation is a simple feature.

In practice, sf objects are often initially created by reading in a shapefile (more on that later). However, particularly for POINT observations, we may need to create an sf object manually.

In what follows, we create sf objects for POINT geometry objects and POLYGON geometry objects.

  • We can use the same previously discussed functions (e.g., sf::st_point, sf::st_polygon, etc.) to create the geometry objects.
  • The sf::st_sfc function is used to create a geometry list column.
  • The sf::st_sf function is used to combined a data.frame of attributes with the geometry list column.
# create POINT objects
pt1 <- st_point(c(0, 0))
pt2 <- st_point(c(0, 1))
pt3 <- st_point(c(1, 1))
# create geometry list column
glc1 <- st_sfc(list(pt1, pt2, pt3))
# is glc1 a list?
is.list(glc1)
[1] TRUE
# what class is glc1
class(glc1)
[1] "sfc_POINT" "sfc"      
# create attribute data.frame with temperature and rainfall attributes
df1 <- data.frame(temperature = c(10, 11, 10.4), rainfall = c(1, 1.3, 0.9))
# create sf object
sf1 <- st_sf(df1, geometry = glc1)
# class of sf1
class(sf1)
[1] "sf"         "data.frame"
# plot sf1
plot(sf1["rainfall"], pch = 20, axes = TRUE)

# create polygon objects
outer1 <- matrix(c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0), ncol = 2)
pl1 <- st_polygon(list(outer1))
# outer2 is outer1 shifted 1 unit to the right
outer2 <- matrix(c(1, 1, 2, 2, 1, 0, 1, 1, 0, 0), ncol = 2)
pl2 <- st_polygon(list(outer2))
# create geometry list columns
glc2 <- st_sfc(list(pl1, pl2))
# what class is glc2
class(glc2)
[1] "sfc_POLYGON" "sfc"        
# create second sf object (only include an attribute column and geometry)
sf2 <- st_sf(cases = c(10, 12), geometry = glc2)
# class of sf2
class(sf2)
[1] "sf"         "data.frame"
# plot sf2
plot(sf2)

# what happens if you combine geometry types?
glc3 <- st_sfc(list(pt1, pt2, pt3, pl1, pl2))
# what class is glc3
class(glc3)
[1] "sfc_GEOMETRY" "sfc"         
# create sf3
sf3 <- st_sf(attribute1 = rnorm(5), geometry = glc3)
# class of sf3
class(sf3)
[1] "sf"         "data.frame"
# plot sf3
plot(sf3, pch = 20, pal = topo.colors)

# plot sf3 with even breaks
plot(sf3, 
     pch = 20, 
     pal = topo.colors, 
     breaks = seq(min(sf3$attribute1), max(sf3$attribute1), length = 12)
     )

Importing Shapefiles as sf Objects


A data scientist is most likely to work with sf objects obtained from importing a shapefile into R.

ArcGIS defines a shapefile as:

A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas). What is a shapefile?

Generally, a shapefile can be imported into R as an sf object.

  • Each row is an observation related to a geometry object.
  • There should be a geometry column that contains the geometry object for each observation (this is the geometry list column).
  • The other columns will represent the attributes associated with each geometry object.

Shapefiles are widely available for describing many different spatial objects like counties, census tracts, zip code tabulation areas (ZCTAs), states, countries, etc. There are even shapefiles that can be used to describe other spatial objects like roads, rivers, lakes, etc.

  • A simple web search with appropriate terms will usually bring up a website with a relevant shapefile for our data.
  • e.g., search “usa shapefile” brings up a Census bureau page with many different shapefiles for different areas and characteristics of the United States.
  • A “shapefile” is often a zipped folder that contains many files inside it.
  • The .shp file is usually the file we want to import.

We can import that shapefile into R as an sf object using the st_read function.

  • Typically, we want to provide the shp file to the dsn argument (data source name) of st_read.

The file cb_2023_us_state_20m.zip containing a shapefile of the U.S. states has already been downloaded and unzipped into the /data folder in our current working directory which /cloud/project.

  • The current working directory is the location to which R currently reads or saves files.
  • We can learn what our current working directory is by running getwd() in the Console.
  • If we want to change the working directory for all code chunks, we can set it via a setup code chunk in the beginning of our document.
  • If we we want to change the working directory for a single code chunk, we can use the setwd("path") function, where path is the directory path we want to set as our working directory.

In our case, we want to read the file cb_2023_us_state_20m.shp in the cb_2023_us_state_20m folder in the data folder of our current working directory.

We can run the following command to import the desired shapefile.

  • The dsn argument “data/cb_2023_us_state_20m/cb_2023_us_state_20m.shp tells R to look for the cb_2023_us_state_20m.shp file in the file path data/cb_state_2023_us_state_20m.
######################################################
# if you are running RStudio locally on your computer
# you will need to download and unzip the folder 
# below to access the shape file
######################################################
#download.file("https://github.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/raw/main/Data/cb_2023_us_state_20m.zip", destfile = "./cb_2023_us_state_20m.zip", method = "auto")
#unzip("cb_2023_us_state_20m.zip", exdir = "./data/cb_2023_us_state_20m")
us_sf <- st_read(dsn = "./data/cb_2023_us_state_20m/cb_2023_us_state_20m.shp")
Reading layer `cb_2023_us_state_20m' from data source 
  `/Users/adamspiegler/Dropbox/CUDenver/Math3376/Notes/Posit-Cloud/data/cb_2023_us_state_20m/cb_2023_us_state_20m.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 52 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS:  NAD83

The st_read provides helpful information automatically when run. In this case, we learn:

  • We imported an sf object with 52 features (observations).
  • The imported sf object has 9 attributes (columns).
  • The features appear to be the simple feature geometry MULTIPOLGYON.
  • The dimension is XY, so we are working with 2-dimensional data.
  • The bounding box tells us the largest and smallest y-coordinates.
  • The CRS is NAD83.

We use the class function to see the classes of us_sf.

class(us_sf)
[1] "sf"         "data.frame"

us_sf is a data.frame that has been extended to the sf class.

Let’s use the str function to learn more about the structure of us_sf.

str(us_sf)
Classes 'sf' and 'data.frame':  52 obs. of  10 variables:
 $ STATEFP : chr  "48" "06" "21" "13" ...
 $ STATENS : chr  "01779801" "01779778" "01779786" "01705317" ...
 $ GEOIDFQ : chr  "0400000US48" "0400000US06" "0400000US21" "0400000US13" ...
 $ GEOID   : chr  "48" "06" "21" "13" ...
 $ STUSPS  : chr  "TX" "CA" "KY" "GA" ...
 $ NAME    : chr  "Texas" "California" "Kentucky" "Georgia" ...
 $ LSAD    : chr  "00" "00" "00" "00" ...
 $ ALAND   : num  6.77e+11 4.04e+11 1.02e+11 1.49e+11 1.40e+11 ...
 $ AWATER  : num  1.90e+10 2.03e+10 2.38e+09 4.42e+09 2.93e+10 ...
 $ geometry:sfc_MULTIPOLYGON of length 52; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:551, 1:2] -107 -107 -107 -107 -106 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
  ..- attr(*, "names")= chr [1:9] "STATEFP" "STATENS" "GEOIDFQ" "GEOID" ...

us_sf has 52 rows and 10 columns. The (most) useful columns are:

  • STUSPS: the state abbreviation
  • NAME: the state name
  • ALAND: the land area of each state
  • AWATER: the water area of each state

The geometry column is the simple feature geometry list column and contains the geometry object for each observation.

Wrangling sf Objects


An sf object is a type of data frame, similar to how a tibble is a type of data frame. Both classes extend the data.frame class.

  • This means that we can work with sf objects similarly to how we would work with a data.frame, though the default behaviors may be different.
  • We can use the dplyr package or similar tools to manipulate an sf object.

We can select columns of the us_sf sf object in the following ways:

# a vector
us_sf$STUSPS
 [1] "TX" "CA" "KY" "GA" "WI" "OR" "MO" "VA" "TN" "LA" "NY" "ID" "FL" "IL" "MT"
[16] "MN" "MD" "IA" "DC" "WA" "SD" "OH" "NE" "IN" "MA" "NV" "ND" "AR" "MS" "CO"
[31] "NC" "UT" "OK" "WY" "WV" "SC" "ME" "HI" "AL" "KS" "RI" "CT" "MI" "AK" "DE"
[46] "NM" "PR" "VT" "NJ" "PA" "NH" "AZ"
# these are all subsets of an sf object which is an sf object
#us_sf[,5]
#us_sf[,"STUSPS"]
#us_sf["STUSPS"]
us_sf |> select(STUSPS)
Simple feature collection with 52 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS:  NAD83
First 10 features:
   STUSPS                       geometry
1      TX MULTIPOLYGON (((-106.6234 3...
2      CA MULTIPOLYGON (((-118.594 33...
3      KY MULTIPOLYGON (((-89.54443 3...
4      GA MULTIPOLYGON (((-85.60516 3...
5      WI MULTIPOLYGON (((-86.93428 4...
6      OR MULTIPOLYGON (((-124.5524 4...
7      MO MULTIPOLYGON (((-95.76565 4...
8      VA MULTIPOLYGON (((-76.02348 3...
9      TN MULTIPOLYGON (((-90.3007 35...
10     LA MULTIPOLYGON (((-94.04305 3...

Note that the $ operator extracts the column from us_sf (returning a character vector), while the other choices subset us_sf and remain sf objects (i.e., the geometry list column is retained).

We can select rows in a similar fashion.

#us_sf[2:3,]
#us_sf[us_sf$STUSPS == "CO",]
#us_sf[startsWith(us_sf$STUSPS, "C"),]
us_sf |> filter(startsWith(us_sf$STUSPS, "C"))
Simple feature collection with 3 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -71.78936 ymax: 42.04964
Geodetic CRS:  NAD83
  STATEFP  STATENS     GEOIDFQ GEOID STUSPS        NAME LSAD        ALAND
1      06 01779778 0400000US06    06     CA  California   00 403673296401
2      08 01779779 0400000US08    08     CO    Colorado   00 268418756810
3      09 01779780 0400000US09    09     CT Connecticut   00  12541750274
       AWATER                       geometry
1 20291770234 MULTIPOLYGON (((-118.594 33...
2  1185758065 MULTIPOLYGON (((-109.06 38....
3  1816364426 MULTIPOLYGON (((-73.72777 4...

Note that startsWith is a base R function that finds the elements of a character vector that start with a certain set of characters while start_with is a dplyr function that is used to select columns of a data frame based on a pattern.

A really neat feature of sf objects is that can use a spatial object to select rows. Let’s extract the “Colorado” row from us_sf.

co <- us_sf[us_sf$NAME == "Colorado",]
class(co)
[1] "sf"         "data.frame"

If we pass the co object as the row argument inside the square brackets, [ ], then the rows of us_sf with geometry objects that intersect the co geometry object will be returned

(co_intersects <- us_sf[co,])
Simple feature collection with 8 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -114.8142 ymin: 31.33224 xmax: -94.43152 ymax: 45.0059
Geodetic CRS:  NAD83
   STATEFP  STATENS     GEOIDFQ GEOID STUSPS       NAME LSAD        ALAND
23      31 01779792 0400000US31    31     NE   Nebraska   00 198949602728
30      08 01779779 0400000US08    08     CO   Colorado   00 268418756810
32      49 01455989 0400000US49    49     UT       Utah   00 213921849163
33      40 01102857 0400000US40    40     OK   Oklahoma   00 177664484361
34      56 01779807 0400000US56    56     WY    Wyoming   00 251458162746
40      20 00481813 0400000US20    20     KS     Kansas   00 211753777631
46      35 00897535 0400000US35    35     NM New Mexico   00 314198587197
52      04 01779777 0400000US04    04     AZ    Arizona   00 294366106734
       AWATER                       geometry
23 1379309601 MULTIPOLYGON (((-104.0531 4...
30 1185758065 MULTIPOLYGON (((-109.06 38....
32 5963196691 MULTIPOLYGON (((-114.0525 3...
33 3373395450 MULTIPOLYGON (((-103.0025 3...
34 1868053273 MULTIPOLYGON (((-111.0569 4...
40 1345707708 MULTIPOLYGON (((-102.0517 4...
46  726463919 MULTIPOLYGON (((-109.0492 3...
52  854003932 MULTIPOLYGON (((-114.7997 3...
plot(co_intersects["NAME"])

plot(st_geometry(co_intersects))

Question 2


Create a plot of all states the border Colorado or California.

Solution to Question 2


# create plot of all states that border Colorado or California

The base merge and dplyr *_join functions can be used to merge a data frame with an sf object.

  • The sf object must be the first argument of these functions.

Let’s access a COVID-19 related data frame available in the bayesutils package, which can be installed from GitHub.

  • We install the package from GitHub using the remote::install_github function.
    • Be sure to install the remotes package if you don’t have it.
  • We then load the covid_20210307 data set from the bayesutils package.
#######################################################
# run each of the commands below in the Console below.
#######################################################
#install.packages("remotes")
#remotes::install_github("jfrench/bayesutils")
data("covid_20210307", package = "bayesutils")
str(covid_20210307)
'data.frame':   50 obs. of  9 variables:
 $ state_name    : chr  "Alaska" "Alabama" "Arkansas" "Arizona" ...
 $ state_abb     : chr  "AK" "AL" "AR" "AZ" ...
 $ deaths        : num  305 10148 5319 16328 54124 ...
 $ cases         : num  56886 499819 324818 826454 3501394 ...
 $ population    : num  738516 4864680 2990671 6946685 39148760 ...
 $ income        : num  35455 25734 25359 29348 31086 ...
 $ hs            : num  91 82.1 82.9 85.6 80.7 89.7 88.6 87.7 85.5 84.3 ...
 $ bs            : num  27.9 21.9 19.5 25.9 30.1 36.4 35.5 27.8 25.8 27.3 ...
 $ vote_diff_2020: num  -10.5 -25.8 -28.4 0.3 29.8 13.9 20.4 19.3 -3.4 0.2 ...
 - attr(*, "na.action")= 'omit' Named int [1:6] 4 9 13 28 43 51
  ..- attr(*, "names")= chr [1:6] "4" "9" "13" "28" ...

The state_abb column of covid_20210307 has the state abbreviations and matches the STUSPS column of us_sf.

We use the base::merge function to unite these two objects into a new object, covid_us.

covid_us <- merge(us_sf, covid_20210307,
                  by.x = "STUSPS", by.y = "state_abb")
head(covid_us)
Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1743 ymin: 30.22831 xmax: 179.7739 ymax: 71.35256
Geodetic CRS:  NAD83
  STUSPS STATEFP  STATENS     GEOIDFQ GEOID       NAME LSAD        ALAND
1     AK      02 01785533 0400000US02    02     Alaska   00 1.479017e+12
2     AL      01 01779775 0400000US01    01    Alabama   00 1.311850e+11
3     AR      05 00068085 0400000US05    05   Arkansas   00 1.346605e+11
4     AZ      04 01779777 0400000US04    04    Arizona   00 2.943661e+11
5     CA      06 01779778 0400000US06    06 California   00 4.036733e+11
6     CO      08 01779779 0400000US08    08   Colorado   00 2.684188e+11
        AWATER state_name deaths   cases population income   hs   bs
1 245347100126     Alaska    305   56886     738516  35455 91.0 27.9
2   4582326383    Alabama  10148  499819    4864680  25734 82.1 21.9
3   3122251184   Arkansas   5319  324818    2990671  25359 82.9 19.5
4    854003932    Arizona  16328  826454    6946685  29348 85.6 25.9
5  20291770234 California  54124 3501394   39148760  31086 80.7 30.1
6   1185758065   Colorado   5989  436602    5531141  35053 89.7 36.4
  vote_diff_2020                       geometry
1          -10.5 MULTIPOLYGON (((179.4813 51...
2          -25.8 MULTIPOLYGON (((-88.46866 3...
3          -28.4 MULTIPOLYGON (((-94.61792 3...
4            0.3 MULTIPOLYGON (((-114.7997 3...
5           29.8 MULTIPOLYGON (((-118.594 33...
6           13.9 MULTIPOLYGON (((-109.06 38....

Alternatively, we could have used a dplyr *_join function. We’ll use full_join. Note the special syntax in the by argument to address the fact that we want to merge the rows based on different columns in the data frames.

covid_us2 <- full_join(us_sf, covid_20210307, by = c("STUSPS" = "state_abb"))
head(covid_us2)
Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.5524 ymin: 25.84012 xmax: -80.84313 ymax: 47.05468
Geodetic CRS:  NAD83
  STATEFP  STATENS     GEOIDFQ GEOID STUSPS       NAME LSAD        ALAND
1      48 01779801 0400000US48    48     TX      Texas   00 676686238592
2      06 01779778 0400000US06    06     CA California   00 403673296401
3      21 01779786 0400000US21    21     KY   Kentucky   00 102266598312
4      13 01705317 0400000US13    13     GA    Georgia   00 149485311347
5      55 01779806 0400000US55    55     WI  Wisconsin   00 140292627460
6      41 01155107 0400000US41    41     OR     Oregon   00 248630419895
       AWATER state_name deaths   cases population income   hs   bs
1 18982083586      Texas  44451 2686818   27885195  30069 80.7 25.9
2 20291770234 California  54124 3501394   39148760  31086 80.7 30.1
3  2384223544   Kentucky   4819  410709    4440204  26174 81.9 20.5
4  4419673221    Georgia  17906 1023487   10297484  28838 84.3 27.3
5 29343084365  Wisconsin   7106  621654    5778394  31604 90.1 26.3
6  6168960338     Oregon   2296  157079    4081943  29696 88.8 28.8
  vote_diff_2020                       geometry
1           -5.7 MULTIPOLYGON (((-106.6234 3...
2           29.8 MULTIPOLYGON (((-118.594 33...
3          -26.4 MULTIPOLYGON (((-89.54443 3...
4            0.2 MULTIPOLYGON (((-85.60516 3...
5            0.6 MULTIPOLYGON (((-86.93428 4...
6           16.6 MULTIPOLYGON (((-124.5524 4...

If a new row is added to the sf object without a corresponding geometry, then an empty geometry object is added for that row.

Plotting Simple Features


The power of the sf package is the ability to easily create plots of spatial data.

Plotting sfc Objects


To simply plot the geometry list column of an sf object, we can use:

  • st_geometry to extract the list column of simple feature geometries from the sf object (this will be an object of class sfc).
  • plot to plot the sfc object.
plot(st_geometry(us_sf))

Alternatively, we can directly extract the sfc component of the sf object using $, then plot the sfc object.

plot(us_sf$geometry)

We can easily change aspects of the plotted sfc object (or an sf) object using the standard arguments:

  • axes can be set to TRUE to show the axes
  • xlab and ylab will change the axis labels
  • xlim and ylim can be used to constrain the plotting regions.
    • Note that “W” longitude values are indicated using negative numbers, while “E” longitude values are positive numbers.
    • Note that “N” latitude values in the northern hemisphere are positive numbers. “S” latitude values in the southern hemisphere are negative numbers.

Each geometry has specific arguments that the user can change (consider looking at the documentation for ?sf::plot.sf for details). In this case, we can change the fill color of the MULTIPOLYGON objects using the color argument and the border using the border argument.

plot(us_sf$geometry, axes = TRUE,
     xlab = "longitude", ylab = "latitude",
     col = "grey", border = "blue",
     xlim = c(-125, -75),
     ylim = c(22, 50))

We can change the colors of the individual geometry objects with a little creativity. Let’s color all the “C states” (California, Colorado, Connecticut) a little differently than the other states.

# determine indices of C states
c_states <- startsWith(us_sf$STUSPS, "C")
# create a character vector of replicated "grey"
# values matching the number of rows in us_sf
mycol <- rep("grey", nrow(us_sf))
#change "grey" to "orange" for the c_states indices
mycol[c_states] <- "orange"
# create a character vector of replicated "yellow"
# values matching the number of rows in us_sf
myborder <- rep("yellow", nrow(us_sf))
#change "yellow" to "blue" for the c_states indices
myborder[c_states] <- "blue"
# create the customized plot
plot(st_geometry(us_sf), col = mycol, border = myborder, xlim = c(-125, -75),
     ylim = c(22, 50))

Customize Color for Subset of sf Object


Let’s look at a different approach to coloring a subset of the sf object differently from the rest of the sf object.

# draw all geometries
plot(st_geometry(us_sf), col = "grey", border = "yellow",
     xlim = c(-125, -75), ylim = c(22, 50))
# plot subset of all geometries with different colors
# add = TRUE overlays new drawing on current graphic
plot(st_geometry(us_sf)[c_states],
     col = "orange", border = "blue",
     add = TRUE)

Plotting Attributes of an sf Object


By default, use we can use the plot function to plot all the attributes of an sf object. In general, this isn’t very useful.

plot(covid_us)

More likely, we will want to plot a single attribute (variable), which can be done by subsetting that variable in the sf object and plotting the subsetted object.

Let’s plot the land area of the states, excluding Alaska and Hawaii. First, we exclude the Alaska and Hawaii rows (and save the filtered object).

covid_us <- covid_us |> filter(!is.element(STUSPS, c("AK", "HI")))
plot(covid_us["ALAND"])

Land area is directly related to the area of the state. It would be interesting to visualize the states that have the greatest percentage of water. Let’s create a new variable in covid_us that computes the percentage of the state that is water. We’ll then plot this variable (excluding Alaska and Hawaii)

covid_us <- covid_us |> mutate(prop_water = AWATER/(AWATER + ALAND))
plot(covid_us["prop_water"])

Not surprisingly, coastal states and states on the Great Lakes have the highest percentage of water.

We can use the following commands to identify the states with the 6 largest proportions of water.

covid_us |> slice_max(prop_water, n = 6) |> select(NAME, prop_water)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -90.41814 ymin: 24.49813 xmax: -69.92826 ymax: 48.21065
Geodetic CRS:  NAD83
           NAME prop_water                       geometry
1      Michigan  0.4146584 MULTIPOLYGON (((-84.61622 4...
2  Rhode Island  0.3308018 MULTIPOLYGON (((-71.63147 4...
3 Massachusetts  0.2608631 MULTIPOLYGON (((-70.27553 4...
4       Florida  0.2485776 MULTIPOLYGON (((-81.81169 2...
5      Maryland  0.2172144 MULTIPOLYGON (((-76.04621 3...
6      Delaware  0.2170690 MULTIPOLYGON (((-75.77379 3...

We can change the number of breaks in our color bar via the nbreaks.

We can change the colors in our color bar by specifying the pal argument.

  • The pal argument takes a function that, when given n, the number of desired colors, returns a vector of colors.
  • The hcl.colors, rainbow, heat.colors, terrain.colors, topo.colors, and cm.colors are all color palettes in base R that can be used to change the colors palettes.
  • The hcl.colors palette is particularly good, as it includes color palettes viridis and cividis (corrected viridis) that are particularly well-suited to displaying colors that are color-blind and can be understood when printed in black and white.
  • The colorspace package also has a wide variety of palettes we might consider.

However, the hcl.colors function has a palette argument to specify the desired palette. To use an hcl.color palette when plotting an sf object, we need to build a custom palette that only requires the number of colors desired. We create those function for the viridis and cividis palettes below. We then see that it produces the desired results.

plot(covid_us["prop_water"], pal=terrain.colors, nbreaks = 5)

viridis_pal <- function(n) hcl.colors(n, palette = "viridis")
cividis_pal <- function(n) hcl.colors(n, palette = "cividis")
# dirty approach to see colors in the palette
image(matrix(0:4), col = viridis_pal(5))

image(matrix(0:4), col = cividis_pal(5))

Let’s do some analysis of the actual COVID-19 data. Let’s create a new column for death_rate_100k, which is the number of confirmed and probable COVID-19 deaths per 100,000 persons. Let’s display the death rate using the viridis palette and then the cividis palette.

covid_us <- covid_us |> mutate(death_rate_100k = deaths/population*100000)
plot(covid_us["death_rate_100k"],
     nbreaks = 5,
     pal = viridis_pal)

plot(covid_us["death_rate_100k"],
     nbreaks = 5,
     pal = cividis_pal)

Plotting sf Objects with ggplot2


The ggplot2 package includes a geometry for sf objects, geom_sf, the is typically adequate for using ggplot2 to produce graphics for sf objects.

  • The sf object is passed as the data argument to the ggplot function.
  • geom_sf is the geometry of the data
  • The XY coordinates of the sf object are automatically mapped to x and y aesthetics.
  • The color, linetype, fill, etc., of the geometry objects can be changed by mapping an attribute of the sf object to the appropriate aesthetic.

If we only want to plot the geometry list column (i.e., the geometry objects) of each observation, then we don’t have to specify any aesthetics.

ggplot(covid_us) + geom_sf()

We will create a choropleth map of the covid_us data.

  • A choropleth map is a map of regions colored to indicate the level of some variable associated with the regions.
ggplot(covid_us, aes(fill = death_rate_100k)) + 
  geom_sf()

Changing the color palette for our fill aesthetic to a viridis palette.

ggplot(covid_us, aes(fill = death_rate_100k)) + 
  geom_sf() +
  colorspace::scale_fill_continuous_sequential(palette = "viridis")

Change the color palette to viridis using ggplot2’s built-in function.

ggplot(covid_us, aes(fill = death_rate_100k)) + 
  geom_sf() +
  scale_fill_viridis_c(option = "viridis")

Reverse the order of the colors.

ggplot(covid_us, aes(fill = death_rate_100k)) + 
  geom_sf() +
  scale_fill_viridis_c(option = "viridis", direction = -1)

Use the cividis palette using the scale_fill_viridis function from the viridis package.

ggplot(covid_us, aes(fill = death_rate_100k)) + 
  geom_sf() +
  viridis::scale_fill_viridis(option = "E")

Additional Resources


sf help and tutorials from the package authors https://r-spatial.github.io/sf/

Geocomputation with R by Robin Lovelace https://r.geocompx.org/. This book covers tons of aspect of spatial data (not just from an R perspective, but general theory and concepts). This will help you to learn a lot more about representing and working with spatial data in general, not just working with it in R.

Spatial Data Science with applications in R by Edzer Pebesma and Roger Bivand https://r-spatial.org/book/. Also covers much much about representing spatial data than you probably ever thought possible! The authors are the main creators of the sf package.

Appendix: Color Palettes Available with base R


Appendix: Color Swatches for HCL Palettes